Transcription start site heterogeneity and its role in RNA fate determination distinguish HIV-1 from other retroviruses and are mediated by core promoter elements

HIV-1 uses heterogeneous transcription start sites (TSSs) to generate two RNA 5’ isoforms that adopt radically different structures and perform distinct replication functions. Although these RNAs differ in length by only two bases, exclusively the shorter RNA is encapsidated while the longer RNA is excluded from virions and provides intracellular functions. The current study examined TSS usage and packaging selectivity for a broad range of retroviruses and found that heterogenous TSS usage was a conserved feature of all tested HIV-1 strains, but all other retroviruses examined displayed unique TSSs. Phylogenetic comparisons and chimeric viruses’ properties provided evidence that this mechanism of RNA fate determination was an innovation of the HIV-1 lineage, with determinants mapping to core promoter elements. Fine-tuning differences between HIV-1 and HIV-2, which uses a unique TSS, implicated purine residue positioning plus a specific TSS-adjacent dinucleotide in specifying multiplicity of TSS usage. Based on these findings, HIV-1 expression constructs were generated that differed from the parental strain by only two point mutations yet each expressed only one of HIV-1’s two RNAs. Replication defects of the variant with only the presumptive founder TSS were less severe than those for the virus with only the secondary start site.

In the two HIV-1 strains tested thus far for heterogenous start site use -one each from subtypes B and A -their TSS regions possess a trinucleotide motif consisting of three sequential guanosines, with the first and third recognized to initiate transcripts alternately with three or one 5' terminal guanosine [1,5]. Thus, the conservation of this motif was tested a first step toward addressing TSS usage among lentiviruses.
TSS region sequences were analyzed for more than 3000 near full-length LTR sequences available in the Los Alamos HIV-1 Sequences Database (https://www.hiv.lanl.gov/content/index). Analyzed sequences included representatives of HIV-1 groups M, O, N, and P, plus several isolates of SIVcpz and SIVgor. The results of this analysis (Fig.1A) revealed that a large majority (approx. 99%) retained three sequential guanosines, flanked by additional conserved residues to yield a consensus sequence of TACT GGG TCTC. Thus, HIV-1's TSS region is highly conserved.
To generate infected cell and virion RNA samples for these additional viruses, virusspecific host cell lines were chronically infected with RSV, RD114 or MLV viruses (see Material and Methods), and CaDAL was used to determine precise RNA 5' ends (Fig.   1E). The results showed that all three of these simple retroviruses produced single RNA forms within cells, and that these same RNAs also were packaged into virions.
Sequencing CaDAL products confirmed that both RSV and MLV transcripts initiated at the G residues that had previously been implicated [26,27] (Fig. 1D). RD114 was determined to initiate RNA synthesis uniquely from the final G residue of the 4 G run at its TSS (Fig. 1D).
After finding that simple retroviruses differed from HIV-1 in TSS usage and packaging, and thus apparently in their strategies for defining genomic vs messenger RNA pools, heterogeneous TSS usage was then examined for additional lentiviruses. These included two viruses belonging to the HIV-2/SIVmac clade (HIV-2 ST and SIVmac239) plus the feline lentivirus FIV. As indicated in Figure 1F, TSS region sequences for these viruses differ from those conserved among HIV-1 strains but retain significant purine richness.
To map the 5' ends of these lentiviral RNAs, SIVmac239 and FIV cell and viral RNAs were harvested from chronically infected human and cat cell lines, respectively, while HIV-2 was transiently expressed in 293T cells (see Materials and Methods). RNA extracted from producer cells and viral particles was subjected to CaDAL (Fig. 1G).
The results showed that a single RNA 5' isoform was present in cells for each lentivirus, . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 22, 2023. and the same RNA was encapsidated (Fig. 1G). Sequencing CaDAL products confirmed that all detectable transcription initiated from the first of two TSS-region purines for both HIV-2 and SIVmac239, yielding an RNA initiating in a 5' A, and that FIV transcription initiated exclusively from the fourth nucleotide (G) in the six-purine run at its TSS (Fig. 1F). Thus, neither heterogeneous TSS usage nor the use of distinct RNA species in packaging was observed for these non-HIV-1 lentiviruses.
Maximum-likelihood phylogenetic analysis was then conducted to compare observed patterns of TSS usage to the phylogenetic relatedness of these viruses, using sequences spanning from the TATA element of the core promoter (positions -30 for HIV-1 and FIV, and -31 for HIV-2 ( Fig. 2A)) through the gag start codon. The resulting phylogenetic tree was similar to previously published relationships for primate lentiviruses based on gag and pol sequences (Fig. 2B) [28], with all HIV-1 strains clustering together, a separate branch for the HIV-2 viruses, and FIV the most distantly related of the compared lentiviruses. Interestingly, the grouping of viruses on the phylogenetic tree coincides with heterogeneous vs unique TSS usage (Fig. 2B). This correlation suggested the possibility that the strategy of regulating RNA fates through heterogeneous TSS usage emerged after separation of the SIVcpz/HIV-1 and SIVsm/HIV-2 lineages [28] and before radiation of the SIVcpz/HIV-1 lineage into existing HIV-1 and HIV-1-related viruses.

Major determinants of TSS heterogeneity are located upstream of the TSS
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Previous work demonstrated that altering HIV-1 TSS-region G stretch lengths shifts start site positions but maintains heterogeneous TSS usage, suggesting that start site choice is at least partially based on distance counting from upstream promoter elements [1]. These findings suggested that the major determinants of TSS heterogeneity were located upstream of the TSS. Thus, with knowledge that retroviruses other than HIV-1 displayed unique TSS usage, LTR chimeras were generated (Fig. 3A) to test if HIV-1 sequences upstream of the TSS were sufficient to confer heterogeneous TSS usage.
In the first chimera, U3 sequence from HIV-1 NL4 were appended upstream of MLV transcribed sequences and 5' ends of the intracellular RNAs this chimera produced were compared to those generated by MLV (Fig. 3B). Strikingly, whereas MLV RNAs produced from the MLV promoter displayed a unique TSS, two MLV RNA isoforms were observed in cells transfected with the chimera containing HIV-1 U3 sequences upstream of MLV transcribed sequences (Fig. 3B, lanes 2 and 3). Determining these RNAs' 5' ends revealed that their start sites' positioning and heterogeneity matched what would be predicted if the major determinants of HIV-1 TSS heterogeneity were located upstream of the TSS, in U3 sequences (Fig. 3A).
Two additional chimeras were generated to test whether MLV promoter sequences would generate unique TSS HIV-1 transcripts: one in which HIV-1 U3 sequences were replaced with MLV U3 at a spacing predicted to generate cap 3G RNAs exclusively, and a second designed to generate cap 1G RNAs if MLV U3 sequences dictated TSS multiplicity and positioning (Fig. 3A). RNAs produced by these vectors were analyzed . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 22, 2023. by CaDAL (Fig. 3B). The results showed that HIV-1 RNAs transcribed from an MLV promoter each had a unique 5' end (Fig. 3B, lanes 5 and 6). Precise 5' ends were confirmed by sequencing CaDAL products.
Virus-based competition experiments were then performed to address whether or not the MLV-driven unique 5' end HIV-1 RNAs expressed from these latter chimeras performed the predicted functions (Fig. 3C). Pairwise combinations of HIV-1 Y+ helper plasmid and test vectors (either the NL4-3-based vector used in Fig. 1 or one of the two MLV/HIV-1 chimeras) were co-transfected into 293 T cells. RNA was prepared from both transfected cells and the viral particles they released, subjected to an RNase protection assay using a labeled probe that protected different length helper and vector fragments, and gel-separated products were analyzed by autoradiography (Fig. 3C).
The results indicated that as predicted, cap 1G RNAs were efficiently encapsidated but cap 3G RNAs were excluded from packaging under these competitive conditions.

HIV-1 heterogeneous TSSs do not result from tandem TATA-box use
As previously reported, the HIV-1 promoter contains a non-canonical TATA box with the sequence CATATAA [15]. This so-called CATA-box begins 29 bases upstream of the cap 3G RNA's TSS and is highly conserved among HIV-1 strains. Other lentiviruses, including those studied here, differ from HIV-1 in possessing canonical TATA-box sequences ( Fig. 2A, 2B).
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  (at positions -30 to -24) provide HIV-1's TATA-box functions has been reported [15]. Nonetheless, because these sequences overlap from position -28 through -22 with sequences that contain a second TATA-like signal, it seemed conceivable that HIV-1's heterogeneous TSS usage might reflect alternating use of two overlapping TATA elements (Fig. 4A).
To examine contributions of these non-canonical core promoter elements to heterogeneous TSS usage, a point mutation was introduced to convert HIV-1's CATAbox into a TATA-box (Fig. 4A). 5' end analysis revealed that TATA-NL4-3 continued to produce both cap 1G and cap 3G RNAs, albeit in a different ratio than the parental vector.
Specifically, whereas about 70% of the parental vector's RNAs were of the cap 1G form, about 55% of the TATA-box mutant's RNAs had cap 3G ends (Fig. 4B, lanes 4 and 5).
Thus, single base conversion of the HIV-1 CATA-box into a canonical TATA-box did not ablate TSS usage heterogeneity but did modestly change RNA ratios.
To test the hypothesis that HIV-1's CATA-box region functioned as a tandem TATA-box that caused heterogeneous TSS usage, mutations to ablate one or the other candidate  was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  (Fig. 4A). 5' end mapping showed that despite a potentially inactive second TATA-box, NL4-3 T-26A still produced two RNAs in a ratio similar to the parental vector (Fig. 4B, lane 6).
Additional mutants were examined to further test the possible second TATA-like element. To inactivate the CATA-box while retaining the putative secondary TATA, CA residues at positions -30 and -29 were replaced with GG, creating NL4-3-GGTATA. To inactivate the potential second TATA-box, positions -25 and -24 were substituted with GG bases, creating NL4-3-CATATGG (Fig. 4A). Examining these vectors RNAs' revealed that both maintained heterogeneous TSS usage (Fig. 4B). NL4-3 GGTATA produced cap 1G and cap 3G RNAs at levels similar to the parental vector, whereas NL4-3-CATATGG produced increased cap 3G RNAs (Fig. 4B, lanes 7 and 8). Thus, whereas some CATA-box mutations affected RNA ratios, none disrupted heterogeneous TSS usage. and results were inconsistent with the hypothesis that HIV-1's two TSSs reflect alternate use of two TATA elements.

TSS-proximal sequences are key determinants of heterogeneous TSS usage
Because HIV-2 uses a unique TSS, a series of HIV-1 vectors with chimeric HIV-1/HIV-2 promoters were created to further map determinants of heterogeneous transcription . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  (Fig.5A). Chimera 1 contained an intact HIV-2 U3 upstream of the HIV-1 GGG motif. Consistent with the findings above with MLV promoters upstream of HIV-1 sequences, placing the HIV-2 U3 promoter upstream of HIV-1 transcribed sequences resulted in the use of predominantly a single TSS, albeit with a small amount of residual secondary TSS use (Fig. 5B).
Additional chimeras were constructed to narrow down sequence elements important to multiple TSS use. Chimera 2 contained HIV-2 sequences upstream of the HIV-1 CATAbox followed by HIV-1 sequences, Chimera 3 had HIV-1 sequences upstream of HIV-2 TATA-box through -1 sequences, followed by HIV-1 transcribed sequences, and Chimera 4 was composed of HIV-1 sequences with a 15 bp fragment from HIV-2 replacing the CATA-box and adjacent region. Results with these chimeras (Fig. 5C) indicated that maintaining 15 bp of HIV-1 U3 sequences immediately adjacent to the TSS was sufficient to maintain dual TSS use, even when all upstream sequences were derived from HIV-2.
Previous work has described a conserved sequence element that flanks the TSSs of many Pol II promoters that display focal initiation. This motif is called the Initiator (Inr) and is believed to play a role in the precise positioning of Pol II at the TSS [8,16].
Interestingly, sequences flanking the HIV-2 TSS are a perfect match to the Inr consensus, BBCA+1BW (Fig.5D). In contrast, sequence at the HIV-1 TSS bears no resemblance to the Inr consensus, although sequences in this region have been implicated as functionally important to HIV-1 transcriptional activation [32,33].
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made To test whether HIV-2's Inr was responsible for its unique TSS usage, two HIV-2 Inr region derivatives--both with substitutions based on HIV-1 sequences--were designed and their transcripts' 5' ends determined ( Fig. 5E and F). In one of these mutants, a T residue two bases upstream of the HIV-2 TSS was mutated to a G. This substitution is compatible with the canonical Inr sequence but introduced a purine as a potential secondary TSS at the -2 position, 29 bp downstream from the TATA-box. In the second mutant, the TC dinucleotide immediately upstream of the HIV-2 TSS were inverted to generate a CT, as is present in the corresponding position of HIV-1. This latter change is predicted to disrupt the Inr, because a C at position -1 is one of the most highly conserved Inr residues [16].
Examining intracellular RNAs revealed that both Inr region changes converted the HIV-2 U3 from a focal promoter to one that displayed heterogeneous TSSs (Fig. 5F). The -2 position G substitution maintained Inr homology but G-2 was used as a secondary TSS in addition to the parental A+1 start, and an unanticipated third TSS also was observed that mapped to G+2 (Fig.9C, lane 3). In the Inr disrupting TC inversion mutant (Fig.9C, lane 4), both A+1 and G+2 were used for transcription initiation, with a surprising third RNA isoform starting with a pyrimidine at T-1. These data showed that changes to the two nucleotides just upstream of the HIV-2 TSS, which introduced residues from analogous region of HIV-1, resulted in heterogeneous TSS usage. However, the role of these two nucleotides does not appear to reflect Inr functions, because both changes to . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

TSS proximal dinucleotide and core promoter element spacing focus TSS
The distance from the HIV-2 TATA-box to its +1 position is 31 bp, while the distance from the HIV-1 CATA box to +1a TSS is 29 bp and to +1b is 31 bp. Thus, published observations that reported TSSs shifted when the number of HIV-1 TSS G residues changed [1], paired with the fact that most eukaryotic Pol II transcripts begin with a purine [7,9] raised the possibility that HIV-1 TSS choice was due primarily to core promoter element spacing relative to purine residues rather than to any specific properties of TSS-adjacent sequences.
To address this, promoter element spacing mutants were generated (Fig. 6). First, two dinucleotide insertion mutants were constructed that increased the distance between HIV-1's CATA-box and its TSS (Fig. 6A). Both retained HIV-1's purine-rich GGG trinucleotide motif and increased spacing to the CATA-box to 31 bp. However, one mutant was lengthened by a two bp duplication about 20 bp upstream of the TSS while the other was lengthened by an insertion of two residues from the corresponding region of HIV-2 immediately upstream of the GGG motif.
Determining TSS usage for these ( Fig.6B) showed that the duplicated AG bases at positions -21 and -20 maintained heterogeneous TSS usage and added a third RNA start three bases upstream of +1a (Fig.6B, lane 5). Both TSSs of the parental NL4-3 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made vector were also still used, but the major RNA form shifted from cap 1G to cap 3G . These results suggested that nucleotide distance counting from the CATA-box contributed to specifying which residues served as HIV-1's TSSs but was not the major determinant of heterogeneous TSS usage. In contrast, the promoter lengthened by the insertion of an HIV-2 derived TC dinucleotide immediately upstream of the GGG motif produced no cap 1G RNA. This later insertion converted the HIV-1 promoter into a focused promoter that almost exclusively produced cap 3G RNAs (Fig. 6B, lane 6).
Based on these observations, an additional variant was constructed that reverted promoter element spacing to that of HIV-1 but maintained the HIV-2 derived dinucleotide as a two base substitution at HIV-1 positions +1a and +1 ( Fig. 6A and 6B lane 7). Results with this variant indicated that introducing two substitutions into HIV-1's TSS region were sufficient to convert HIV-1 from heterogeneous TSS use to initiating transcription at a unique position.
These final two mutations-one a two-base insertion, and the second a two-base substitution-were thus each sufficient to convert HIV-1 from a heterogeneous TSS using virus to a unique TSS using virus: the first expressing only cap 3G RNAs and the second only cap 1G RNAs. To test their replication properties, these two mutant promoter regions were incorporated into both LTRs of a fully infectious HIV-1 NL4-3 molecular clone. Virions were produced by transfecting 293T cells, normalized by RT levels, and used to infect CEM-SS cells, The results of monitoring virus spread over time (Fig. 6C) indicated that the cap 1G-only virus had delayed replication kinetics compared to parental . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  (Fig. 6C), although replication recovered and virus production reached parental NL4-3 levels after four weeks of passage. In contrast, the cap 3G-only mutant virus showed significant replication deficiencies. Its very low virus production persisted throughout the four week experiment (Fig.6C).

Discussion
HIV-1 subtypes A and B have previously been shown to initiate transcription at two distinct template positions and generate two major RNA isoforms: cap 1G and cap 3G [1,2]. These RNAs adopt distinct folded structures that differ in their abilities to support specific replication processes, and thus heterogeneous TSS usage regulates HIV-1 RNA fates. The two HIV-1 TSSs consist of the first and the third residues in a GGG motif located in the viral LTR, with the cap 1G RNA form serving as virion genomic RNA and cap 3G RNAs translated to yield viral proteins. Due to its GGG trinucleotide motif and two TSSs separated by one nucleotide, the HIV-1 promoter resembles the so-called twin-TSS promoters first characterized from a high-throughput screen of mammalian transcripts [9].
Here, database analysis showed that the TSS motif is highly conserved among HIV-1 family members. This prompted an experimental analysis of TSS usage in additional HIV-1 strains and a broader range of retroviruses. The results showed that all tested HIV-1 strains, including several different subtypes and a group O isolate, produced two RNA forms in cells, only one of which was observed in viral particles. Thus, the control . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made In contrast, all other retroviruses tested displayed only a single 5' end isoform, both in cells and in virions. This was true for the simple retroviruses RSV, RD114, and MLV as well as for the lentiviruses HIV-2, SIVmac, and FIV. RSV and MLV have previously been found to use single TSSs based on virion RNA sequence analysis [26,27].
Another report described some heterogeneity at the 5' end of MLV RNA extracted from viral particles or transcribed in cell lysates, but these were assayed by S1 nuclease protection and the authors concluded that there may have been partial DNA probe protection by the RNA 5' cap [34]. Pol II transcription generally initiates at purine residues [7,9], and because some of the viruses tested here have polypurine runs in their TSS regions (GGGG in RD114, or GAGGAG in FIV) it seemed possible that they might employ more than one TSS. However, the cap-dependent adapter ligation/PCR assay used here revealed that each of these used a unique TSS, and that the same single RNA form was observed in both cells and virions.
Thus, TSS heterogeneity is not an obligatory feature of retroviral replication but instead one that distinguishes HIV-1 from all other tested retroviruses, suggesting that other retroviruses must use strategies different from HIV-1's to control their unspliced RNAs' functions. As a possible example of this, recently it was shown that the cytoplasmic fates of full-length MLV RNA could be determined by which nuclear export pathway and nuclear export factors were recruited to viral RNA [35]. Specifically, it has been reported . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made that binding of NXF1 and SRp20 to full-length MLV RNA drives the RNA through NXF1 dependent nuclear export to polysomes [36], whereas interaction with CRM1 drives viral RNA to virion assembly sites through CRM1 dependent nuclear export [35].
Interestingly, while this article was in preparation, another group examined TSS usage in lentiviruses [37]. That paper confirmed heterogeneous TSS usage in multiple HIV-1 strains but differed from the current study in its HIV-2 conclusions, which indicated that HIV-2 displays heterogeneous TSSs [37]. We have not determined the cause of this discrepancy but note that while HIV-2 appeared to use a single TSS here, we did observe some TSS heterogeneity when HIV-1 sequences were expressed using the HIV-2 U3 promoter (Chimera 1 in Fig. 5). Additionally, our two reports employed different technologies to map RNA 5' ends. To reduce artifacts and enhance specificity, researchers who perform genome-wide analyses of transcription start sites have now largely replaced the 5' RACE approach used in the other study with CAGE (cap analysis gene expression) or other cap-dependent approaches like those used in our study [38].
Here, the determinants of HIV-1 heterogeneous TSS usage were mapped using targeted mutations and promoter chimeras. Previous work had shown that introducing mutations into the HIV-1 GGG motif did not affect heterogeneous TSS usage [1], and work here confirmed that the naturally arising TSS variant AGG also exhibited heterogeneous TSS usage. Thus, because determinants of TSS use appeared not to reside in the start sites themselves, promoter swaps were constructed to test if the determinants were located upstream or downstream of HIV-1's GGG motif. Analyzing . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 22, 2023. ; https://doi.org/10.1101/2023.05.22.541776 doi: bioRxiv preprint RNA 5' ends revealed that when MLV transcripts were produced using an HIV-1 promoter, two major TSSs separated by one base were employed. In contrast, HIV-1 RNAs produced using the MLV U3 promoter displayed only one TSS, independent of the number of TSS proximal purines.
Having mapped determinants to upstream of the TSS, additional chimeras were generated to test the influence of specific promoter elements. The best characterized core promoter element is the TATA-box, with a consensus sequence of TATAWAAR that is located between -33/-26 to -27/-18 upstream of the TSS [7,9,13]. Many Pol II promoters, including many that support single or twin TSS use, do not have TATAboxes [8,9]. HIV-1 contains a non-canonical TATA-box with the sequence CATATAA, which has been dubbed a CATA-box [15]. CATA-box sequences are highly conserved among HIV-1s and found in a limited number of human promoters, including those for bglobin and IL1B [14]. CATA-boxes are associated with lower levels of transcriptional activity [29,30], lower TBP binding affinity [14,29], and less stable TFIIA-TBP-DNA complexes [39,40]. Previously it was shown that converting the CATA-box into a canonical TATA-box led to elevated levels of HIV-1 promoter expression and enhanced fitness of the virus in chronically infected cell culture [15].
Examining the sequence of HIV-1's CATA-box suggested the possibility that two overlapping TATA-like elements in the HIV-1 core promoter might be responsible for dual TSS use. However, a series of CATA-box region mutants revealed no evidence for . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Additional work indicated that spacing between the CATA/TATA-box and TSS is an important, but not the only, factor in defining lentiviral TSSs. Start site selection through distance counting is characteristic of TATA-box containing promoters [9,41]. In HIV-1, the +1a and +1b TSSs are located 29 and 31 bp from the CATA-box, respectively.
Here, one 2 bp insertion between the CATA-box and TSS changed the ratios of HIV-1 cap 1G and cap 3G RNAs and lead to the appearance of a third RNA that initiated in the first purine upstream of original TSS. However, whereas observations with this mutant suggested that distance counting specified start site selection, this mechanism appeared to be over-ridden in a different mutant, in which the CATA-TSS distance was lengthened by a two nucleotide insertion directly upstream of the GGG motif.
Observations with this later mutant suggested that TSS-proximal residues play a dominant role in determining whether or not TSS heterogeneity is observed.
These TSS-proximal sequences reside in the same location as an element involved in some promoters' TSS selection. The so-called Initiator (Inr) element overlaps the TSS and may function in the precise positioning of Pol II at the TSS [8,16]. Most focused promoters regulated by Inr do not contain TATA-boxes, although a significant proportion contain both a TATA-box and Inr [8,16]. Promoters containing both TATA-box and Inr elements differ from those containing only one of these elements in terms of promoter strength [42] and responses to transcription factors such as NC2 [43] and HMGA1 [44].
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made HIV-1 TSS-flanking sequences, mapped alternately to positions -6-+8 [32] or -6-+11 [33], are important to transcriptional activation, but HIV-1 lacks the BBCA+1BW Inr consensus sequence [16]. In contrast, HIV-2's single TSS is flanked by a perfect match to the Inr consensus.
To test if the presence or absence of an Inr dictated TSS heterogeneity, HIV-1 / HIV-2 core promoter chimeras were tested. Because one mutation that disrupted HIV-2's perfect Inr consensus converted it from a focused promoter into one that displayed heterogeneous starts, initial results suggested that the presence or absence of a canonical Inr element might dictate TSS heterogeneity. However, subsequent experimentation showed that whereas determinants reside in the same genetic interval as Inr elements, the regulation of lentivirus TSS heterogeneity appears mechanistically distinct from Inr-mediated regulation.
The observations here that all retroviruses other than HIV-1 use unique transcription start sites suggest that HIV-1's dependence on two TSSs is a relatively recent acquisition. Results of the promoter element spacing experiments above suggested that the cap 1G TSS is HIV-1's ancestral TSS and that the TSS that generates cap 3G RNA is a newer acquisition. This notion is consistent with experimental observations that cap 1G RNAs can function both in genomic RNA packaging and as mRNAs, whereas cap 3G RNAs are limited to intracellular roles and excluded from packaging [1,3].
Similarly, the initial infectivity studies performed here show that neither variant that expresses only one HIV-1 RNA is fully infectious but that replication of the cap 3G-only . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

RNA extraction, CaDAL assay, Sanger sequencing, RNAse protection assay.
Virus particles were collected by centrifuging filtered culture media through a 20% sucrose cushion at 25,000 rpm for 2 hrs. Virus pellet and cell RNA was isolated using TRIzol according to the manufacturer's protocol (Ambion) and DNase treated. 5′ ends of capped viral RNAs were analyzed using a cap-depended adaptor ligation/PCR (CaDAL) assay [3,24]. Briefly, RNA extracted from viral particles or virus producing cells was used as a template for cDNA synthesis using the TeloPrime Full-Length cDNA Amplification Kit V2 (Lexogen) according to the manufacturer's protocol. Each cDNA . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made RNase Protection assay (RPA) was performed as previously described [6]. The riboprobe HIVgag/CMV used in this study was described previously [6] and targets a 201-nt fragment of gag unique to the NL4-3 GPP helper and 289 nt of CMV promoter sequence from puro-expressing cassette unique to test vectors.

Competing Interest Statement
The authors declare no competing interests.
Acknowledgments . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made at that position measured in bits, whereas the height of symbols within the stack reflects the relative frequency of the corresponding base at that position [45]. Note that HXB2 (accession # K03455.1) GenBank coordinates indicate HIV-1 mRNA starts at the second G of the GGG motif, at LTR position 455, but more recent studies indicate 455 is the least commonly used TSS in the GGG motif [1,2]. Thus, to preserve HXB2 numbering while reflecting experimentally validated start sites, +1 was used to indicate the beginning of viral RNA according to GenBank and the two major experimentally  . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 22, 2023. ; . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made